Part I - (Analysis of Loan Data from Prosper)¶

by (Yusuf Sanni)¶

Introduction¶

The dataset for this exercise contains 113,937 loans with 81 variables on each loan. The details of each variable will be shown below. The idea behind this data analysis is to show if there are any correlations between the variables which can help in showing lenders' behaviours and loan applications. Questions like what factors affects a loan's outcome status, What affects the borrower’s APR or interest rate will be answered with the use of univariate, bivariate and multivariate exploratory data means. That is, after some preliminary wrangling, of course. The key for the data dictonary is shown below

  • ListingKey:Unique key for each listing, same value as the 'key' used in the listing object in the API.
  • ListingNumber: The number that uniquely identifies the listing to the public as displayed on the website.
  • ListingCreationDate: The date the listing was created.
  • CreditGrade: The Credit rating that was assigned at the time the listing went live. Applicable for listings pre-2009 period and will only be populated for those listings.
  • Term: The length of the loan expressed in months.
  • LoanStatus: The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
  • ClosedDate: Closed date is applicable for Cancelled, Completed, Chargedoff and Defaulted loan statuses.
  • BorrowerAPR: The Borrower's Annual Percentage Rate (APR) for the loan.
  • BorrowerRate: The Borrower's interest rate for this loan.
  • LenderYield: The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
  • EstimatedEffectiveYield: Effective yield is equal to the borrower interest rate (i) minus the servicing fee rate, (ii) minus estimated uncollected interest on charge-offs, (iii) plus estimated collected late fees. Applicable for loans originated after July 2009.
  • EstimatedLoss: Estimated loss is the estimated principal loss on charge-offs. Applicable for loans originated after July 2009.
  • EstimatedReturn: The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate. Applicable for loans originated after July 2009.
  • ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.
  • ProsperRating (Alpha): The Prosper Rating assigned at the time the listing was created between AA - HR. Applicable for loans originated after July 2009.
  • ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-11, with 11 being the best, or lowest risk score. Applicable for loans originated after July 2009.
  • ListingCategory: The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
  • BorrowerState: The two letter abbreviation of the state of the address of the borrower at the time the Listing was created.
  • Occupation: The Occupation selected by the Borrower at the time they created the listing.
  • EmploymentStatus: The employment status of the borrower at the time they posted the listing.
  • EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.
  • IsBorrowerHomeowner: A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
  • CurrentlyInGroup: Specifies whether or not the Borrower was in a group at the time the listing was created.
  • GroupKey: The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
  • DateCreditPulled: The date the credit profile was pulled.
  • CreditScoreRangeLower: The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • CreditScoreRangeUpper: The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency.
  • FirstRecordedCreditLine: The date the first credit line was opened.
  • CurrentCreditLines: Number of current credit lines at the time the credit profile was pulled.
  • OpenCreditLines: Number of open credit lines at the time the credit profile was pulled.
  • TotalCreditLinespast7years: Number of credit lines in the past seven years at the time the credit profile was pulled.
  • OpenRevolvingAccounts: Number of open revolving accounts at the time the credit profile was pulled.
  • OpenRevolvingMonthlyPayment: Monthly payment on revolving accounts at the time the credit profile was pulled.
  • InquiriesLast6Months: Number of inquiries in the past six months at the time the credit profile was pulled.
  • TotalInquiries: Total number of inquiries at the time the credit profile was pulled.
  • CurrentDelinquencies: Number of accounts delinquent at the time the credit profile was pulled.
  • AmountDelinquent: Dollars delinquent at the time the credit profile was pulled.
  • DelinquenciesLast7Years: Number of delinquencies in the past 7 years at the time the credit profile was pulled.
  • PublicRecordsLast10Years: Number of public records in the past 10 years at the time the credit profile was pulled.
  • PublicRecordsLast12Months: Number of public records in the past 12 months at the time the credit profile was pulled.
  • RevolvingCreditBalance: Dollars of revolving credit at the time the credit profile was pulled.
  • BankcardUtilization: The percentage of available revolving credit that is utilized at the time the credit profile was pulled.
  • AvailableBankcardCredit: The total available credit via bank card at the time the credit profile was pulled.
  • TotalTrades: Number of trade lines ever opened at the time the credit profile was pulled.
  • TradesNeverDelinquent: Number of trades that have never been delinquent at the time the credit profile was pulled.
  • TradesOpenedLast6Months: Number of trades opened in the last 6 months at the time the credit profile was pulled.
  • DebtToIncomeRatio: The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
  • IncomeRange: The income range of the borrower at the time the listing was created.
  • IncomeVerifiable: The borrower indicated they have the required documentation to support their income.
  • StatedMonthlyIncome: The monthly income the borrower stated at the time the listing was created.
  • LoanKey: Unique key for each loan. This is the same key that is used in the API.
  • TotalProsperLoans: Number of Prosper loans the borrower at the time they created this listing. This value will be null if the borrower had no prior loans.
  • TotalProsperPaymentsBilled: Number of on time payments the borrower made on Prosper loans at the time they created this listing. This value will be null if the borrower had no prior loans.
  • OnTimeProsperPayments: Number of on time payments the borrower had made on Prosper loans at the time they created this listing. This value will be null if the borrower has no prior loans.
  • ProsperPaymentsLessThanOneMonthLate: Number of payments the borrower made on Prosper loans that were less than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
  • ProsperPaymentsOneMonthPlusLate: Number of payments the borrower made on Prosper loans that were greater than one month late at the time they created this listing. This value will be null if the borrower had no prior loans.
  • ProsperPrincipalBorrowed: Total principal borrowed on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
  • ProsperPrincipalOutstanding: Principal outstanding on Prosper loans at the time the listing was created. This value will be null if the borrower had no prior loans.
  • ScorexChangeAtTimeOfListing: Borrower's credit score change at the time the credit profile was pulled. This will be the change relative to the borrower's last Prosper loan. This value will be null if the borrower had no prior loans.
  • LoanCurrentDaysDelinquent: The number of days delinquent.
  • LoanFirstDefaultedCycleNumber: The cycle the loan was charged off. If the loan has not charged off the value will be null.
  • LoanMonthsSinceOrigination: Number of months since the loan originated.
  • LoanNumber: Unique numeric value associated with the loan.
  • LoanOriginalAmount: The origination amount of the loan.
  • LoanOriginationDate: The date the loan was originated.
  • LoanOriginationQuarter: The quarter in which the loan was originated.
  • MemberKey: The unique key that is associated with the borrower. This is the same identifier that is used in the API member object.
  • MonthlyLoanPayment: The scheduled monthly loan payment.
  • LP_CustomerPayments: Pre charge-off cumulative gross payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
  • LP_CustomerPrincipalPayments: Pre charge-off cumulative principal payments made by the borrower on the loan. If the loan has charged off, this value will exclude any recoveries.
  • LP_InterestandFees: Pre charge-off cumulative interest and fees paid by the borrower. If the loan has charged off, this value will exclude any recoveries.
  • LP_ServiceFees: Cumulative service fees paid by the investors who have invested in the loan.
  • LP_CollectionFees: Cumulative collection fees paid by the investors who have invested in the loan.
  • LP_GrossPrincipalLoss: The gross charged off amount of the loan.
  • LP_NetPrincipalLoss: The principal that remains uncollected after any recoveries.
  • LP_NonPrincipalRecoverypayments: The interest and fee component of any recovery payments. The current payment policy applies payments in the following order: Fees, interest, principal.
  • PercentFunded: Percent the listing was funded.
  • Recommendations: Number of recommendations the borrower had at the time the listing was created.
  • InvestmentFromFriendsCount: Number of friends that made an investment in the loan.
  • InvestmentFromFriendsAmount: Dollar amount of investments that were made by friends.
  • Investors: The number of investors that funded the loan.

Rubric Tip: Your code should not generate any errors, and should use functions, loops where possible to reduce repetitive code. Prefer to use functions to reuse code statements.

Rubric Tip: Document your approach and findings in markdown cells. Use comments and docstrings in code cells to document the code functionality.

Rubric Tip: Markup cells should have headers and text that organize your thoughts, findings, and what you plan on investigating next.

Preliminary Wrangling¶

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [2]:
loans = pd.read_csv('prosperLoanData.csv')
loans
Out[2]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113932 E6D9357655724827169606C 753087 2013-04-14 05:55:02.663000000 NaN 36 Current NaN 0.22354 0.1864 0.1764 ... -75.58 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
113933 E6DB353036033497292EE43 537216 2011-11-03 20:42:55.333000000 NaN 36 FinalPaymentInProgress NaN 0.13220 0.1110 0.1010 ... -30.05 0.0 0.0 0.0 0.0 1.0 0 0 0.0 22
113934 E6E13596170052029692BB1 1069178 2013-12-13 05:49:12.703000000 NaN 60 Current NaN 0.23984 0.2150 0.2050 ... -16.91 0.0 0.0 0.0 0.0 1.0 0 0 0.0 119
113935 E6EB3531504622671970D9E 539056 2011-11-14 13:18:26.597000000 NaN 60 Completed 2013-08-13 00:00:00 0.28408 0.2605 0.2505 ... -235.05 0.0 0.0 0.0 0.0 1.0 0 0 0.0 274
113936 E6ED3600409833199F711B7 1140093 2014-01-15 09:27:37.657000000 NaN 36 Current NaN 0.13189 0.1039 0.0939 ... -1.70 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1

113937 rows × 81 columns

In [3]:
loans.describe()
Out[3]:
ListingNumber Term BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperScore ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
count 1.139370e+05 113937.000000 113912.000000 113937.000000 113937.000000 84853.000000 84853.000000 84853.000000 84853.000000 84853.000000 ... 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000 113937.000000
mean 6.278857e+05 40.830248 0.218828 0.192764 0.182701 0.168661 0.080306 0.096068 4.072243 5.950067 ... -54.725641 -14.242698 700.446342 681.420499 25.142686 0.998584 0.048027 0.023460 16.550751 80.475228
std 3.280762e+05 10.436212 0.080364 0.074818 0.074516 0.068467 0.046764 0.030403 1.673227 2.376501 ... 60.675425 109.232758 2388.513831 2357.167068 275.657937 0.017919 0.332353 0.232412 294.545422 103.239020
min 4.000000e+00 12.000000 0.006530 0.000000 -0.010000 -0.182700 0.004900 -0.182700 1.000000 1.000000 ... -664.870000 -9274.750000 -94.200000 -954.550000 0.000000 0.700000 0.000000 0.000000 0.000000 1.000000
25% 4.009190e+05 36.000000 0.156290 0.134000 0.124200 0.115670 0.042400 0.074080 3.000000 4.000000 ... -73.180000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 2.000000
50% 6.005540e+05 36.000000 0.209760 0.184000 0.173000 0.161500 0.072400 0.091700 4.000000 6.000000 ... -34.440000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 44.000000
75% 8.926340e+05 36.000000 0.283810 0.250000 0.240000 0.224300 0.112000 0.116600 5.000000 8.000000 ... -13.920000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 115.000000
max 1.255725e+06 60.000000 0.512290 0.497500 0.492500 0.319900 0.366000 0.283700 7.000000 11.000000 ... 32.060000 0.000000 25000.000000 25000.000000 21117.900000 1.012500 39.000000 33.000000 25000.000000 1189.000000

8 rows × 61 columns

In [4]:
loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [5]:
loans_copy = loans.copy()
In [6]:
loans_copy.head()
Out[6]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 ... -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 ... 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 ... -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 ... -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 ... -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20

5 rows × 81 columns

In [7]:
loans_copy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113937 entries, 0 to 113936
Data columns (total 81 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   ListingKey                           113937 non-null  object 
 1   ListingNumber                        113937 non-null  int64  
 2   ListingCreationDate                  113937 non-null  object 
 3   CreditGrade                          28953 non-null   object 
 4   Term                                 113937 non-null  int64  
 5   LoanStatus                           113937 non-null  object 
 6   ClosedDate                           55089 non-null   object 
 7   BorrowerAPR                          113912 non-null  float64
 8   BorrowerRate                         113937 non-null  float64
 9   LenderYield                          113937 non-null  float64
 10  EstimatedEffectiveYield              84853 non-null   float64
 11  EstimatedLoss                        84853 non-null   float64
 12  EstimatedReturn                      84853 non-null   float64
 13  ProsperRating (numeric)              84853 non-null   float64
 14  ProsperRating (Alpha)                84853 non-null   object 
 15  ProsperScore                         84853 non-null   float64
 16  ListingCategory (numeric)            113937 non-null  int64  
 17  BorrowerState                        108422 non-null  object 
 18  Occupation                           110349 non-null  object 
 19  EmploymentStatus                     111682 non-null  object 
 20  EmploymentStatusDuration             106312 non-null  float64
 21  IsBorrowerHomeowner                  113937 non-null  bool   
 22  CurrentlyInGroup                     113937 non-null  bool   
 23  GroupKey                             13341 non-null   object 
 24  DateCreditPulled                     113937 non-null  object 
 25  CreditScoreRangeLower                113346 non-null  float64
 26  CreditScoreRangeUpper                113346 non-null  float64
 27  FirstRecordedCreditLine              113240 non-null  object 
 28  CurrentCreditLines                   106333 non-null  float64
 29  OpenCreditLines                      106333 non-null  float64
 30  TotalCreditLinespast7years           113240 non-null  float64
 31  OpenRevolvingAccounts                113937 non-null  int64  
 32  OpenRevolvingMonthlyPayment          113937 non-null  float64
 33  InquiriesLast6Months                 113240 non-null  float64
 34  TotalInquiries                       112778 non-null  float64
 35  CurrentDelinquencies                 113240 non-null  float64
 36  AmountDelinquent                     106315 non-null  float64
 37  DelinquenciesLast7Years              112947 non-null  float64
 38  PublicRecordsLast10Years             113240 non-null  float64
 39  PublicRecordsLast12Months            106333 non-null  float64
 40  RevolvingCreditBalance               106333 non-null  float64
 41  BankcardUtilization                  106333 non-null  float64
 42  AvailableBankcardCredit              106393 non-null  float64
 43  TotalTrades                          106393 non-null  float64
 44  TradesNeverDelinquent (percentage)   106393 non-null  float64
 45  TradesOpenedLast6Months              106393 non-null  float64
 46  DebtToIncomeRatio                    105383 non-null  float64
 47  IncomeRange                          113937 non-null  object 
 48  IncomeVerifiable                     113937 non-null  bool   
 49  StatedMonthlyIncome                  113937 non-null  float64
 50  LoanKey                              113937 non-null  object 
 51  TotalProsperLoans                    22085 non-null   float64
 52  TotalProsperPaymentsBilled           22085 non-null   float64
 53  OnTimeProsperPayments                22085 non-null   float64
 54  ProsperPaymentsLessThanOneMonthLate  22085 non-null   float64
 55  ProsperPaymentsOneMonthPlusLate      22085 non-null   float64
 56  ProsperPrincipalBorrowed             22085 non-null   float64
 57  ProsperPrincipalOutstanding          22085 non-null   float64
 58  ScorexChangeAtTimeOfListing          18928 non-null   float64
 59  LoanCurrentDaysDelinquent            113937 non-null  int64  
 60  LoanFirstDefaultedCycleNumber        16952 non-null   float64
 61  LoanMonthsSinceOrigination           113937 non-null  int64  
 62  LoanNumber                           113937 non-null  int64  
 63  LoanOriginalAmount                   113937 non-null  int64  
 64  LoanOriginationDate                  113937 non-null  object 
 65  LoanOriginationQuarter               113937 non-null  object 
 66  MemberKey                            113937 non-null  object 
 67  MonthlyLoanPayment                   113937 non-null  float64
 68  LP_CustomerPayments                  113937 non-null  float64
 69  LP_CustomerPrincipalPayments         113937 non-null  float64
 70  LP_InterestandFees                   113937 non-null  float64
 71  LP_ServiceFees                       113937 non-null  float64
 72  LP_CollectionFees                    113937 non-null  float64
 73  LP_GrossPrincipalLoss                113937 non-null  float64
 74  LP_NetPrincipalLoss                  113937 non-null  float64
 75  LP_NonPrincipalRecoverypayments      113937 non-null  float64
 76  PercentFunded                        113937 non-null  float64
 77  Recommendations                      113937 non-null  int64  
 78  InvestmentFromFriendsCount           113937 non-null  int64  
 79  InvestmentFromFriendsAmount          113937 non-null  float64
 80  Investors                            113937 non-null  int64  
dtypes: bool(3), float64(50), int64(11), object(17)
memory usage: 68.1+ MB
In [8]:
# check for duplicates
loans_copy[loans_copy.duplicated()]
Out[8]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield ... LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors

0 rows × 81 columns

In [9]:
loans_copy.Term.value_counts()
Out[9]:
36    87778
60    24545
12     1614
Name: Term, dtype: int64
In [10]:
loans_copy.Term.hist()
Out[10]:
<AxesSubplot:>

Change the loan term from numbers to categorical terms¶

In [11]:
# Since there are only 3 loan term periods, for clarity, we will change the periods into short term, medium term and long term
# and then convert the data type to categorical
loans_copy.Term = loans_copy.Term.map({12:'Short Term', 36:'Medium Term', 60:'Long Term'})
In [12]:
loans_copy.Term.value_counts()
Out[12]:
Medium Term    87778
Long Term      24545
Short Term      1614
Name: Term, dtype: int64
In [13]:
loans_copy.Term.hist()
Out[13]:
<AxesSubplot:>
In [14]:
# Change data type of loan term to category type
loans_copy.Term = loans_copy.Term.astype('category')

Loan Status¶

In [15]:
loans_copy.LoanStatus.value_counts()
Out[15]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due (1-15 days)        806
Past Due (31-60 days)       363
Past Due (61-90 days)       313
Past Due (91-120 days)      304
Past Due (16-30 days)       265
FinalPaymentInProgress      205
Past Due (>120 days)         16
Cancelled                     5
Name: LoanStatus, dtype: int64
In [16]:
loans_copy.LoanStatus.hist()
plt.xticks(rotation=90);
In [17]:
# Keep all past due loans under one heading for easier viewing
loans_copy.LoanStatus = loans_copy.LoanStatus.replace({'Past Due (1-15 days)':'Past Due', 'Past Due (16-30 days)':'Past Due',
                  'Past Due (31-60 days)':'Past Due', 'Past Due (61-90 days)' :'Past Due','Past Due (91-120 days)' :'Past Due', 
                                                       'Past Due (>120 days)' :'Past Due'})
In [18]:
loans_copy.LoanStatus.value_counts()
Out[18]:
Current                   56576
Completed                 38074
Chargedoff                11992
Defaulted                  5018
Past Due                   2067
FinalPaymentInProgress      205
Cancelled                     5
Name: LoanStatus, dtype: int64
In [19]:
# Change data type of LoanStatus to category
In [20]:
loans_copy.LoanStatus = loans_copy.LoanStatus.astype('category')
In [21]:
loans_copy.LoanStatus.hist()
plt.xticks(rotation=90);

Credit Grade¶

In [22]:
loans_copy.CreditGrade.value_counts()
Out[22]:
C     5649
D     5153
B     4389
AA    3509
HR    3508
A     3315
E     3289
NC     141
Name: CreditGrade, dtype: int64
In [23]:
# Change datatype to category
loans_copy.CreditGrade = loans_copy.CreditGrade.astype('category')
In [24]:
sns.countplot(data=loans_copy, x='CreditGrade')
Out[24]:
<AxesSubplot:xlabel='CreditGrade', ylabel='count'>

BorrowerRate¶

In [25]:
loans_copy.BorrowerRate.value_counts()
Out[25]:
0.3177    3672
0.3500    1905
0.3199    1651
0.2900    1508
0.2699    1319
          ... 
0.2201       1
0.0752       1
0.1416       1
0.2812       1
0.0739       1
Name: BorrowerRate, Length: 2294, dtype: int64
In [26]:
# Change borrower rate to float data type
loans_copy.BorrowerRate = loans_copy.BorrowerRate.astype('float')
In [27]:
loans_copy.BorrowerRate.describe()
Out[27]:
count    113937.000000
mean          0.192764
std           0.074818
min           0.000000
25%           0.134000
50%           0.184000
75%           0.250000
max           0.497500
Name: BorrowerRate, dtype: float64
In [28]:
sns.countplot(data=loans_copy, x='BorrowerRate')
Out[28]:
<AxesSubplot:xlabel='BorrowerRate', ylabel='count'>

Listing Category¶

In [ ]:
 

Annual Percentage Rate¶

In [29]:
loans_copy.BorrowerAPR.value_counts()
Out[29]:
0.35797    3672
0.35643    1644
0.37453    1260
0.30532     902
0.29510     747
           ... 
0.37266       1
0.27518       1
0.18477       1
0.29961       1
0.19543       1
Name: BorrowerAPR, Length: 6677, dtype: int64
In [30]:
# Change data type of APR to float
loans_copy.BorrowerAPR = loans_copy.BorrowerAPR.astype('float')
In [31]:
sns.countplot(data=loans_copy, x='BorrowerAPR')
Out[31]:
<AxesSubplot:xlabel='BorrowerAPR', ylabel='count'>

Prosper Score¶

In [32]:
loans_copy.ProsperScore.value_counts()
Out[32]:
4.0     12595
6.0     12278
8.0     12053
7.0     10597
5.0      9813
3.0      7642
9.0      6911
2.0      5766
10.0     4750
11.0     1456
1.0       992
Name: ProsperScore, dtype: int64
In [33]:
#change data type to category
loans_copy.ProsperScore = loans_copy.ProsperScore.astype('category')
In [34]:
loans_copy.ProsperScore.dtype
Out[34]:
CategoricalDtype(categories=[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0], ordered=False)
In [35]:
sns.countplot(data=loans_copy, x='ProsperScore')
Out[35]:
<AxesSubplot:xlabel='ProsperScore', ylabel='count'>

Borrower State¶

In [36]:
loans_copy.BorrowerState.value_counts()
Out[36]:
CA    14717
TX     6842
NY     6729
FL     6720
IL     5921
GA     5008
OH     4197
MI     3593
VA     3278
NJ     3097
NC     3084
WA     3048
PA     2972
MD     2821
MO     2615
MN     2318
MA     2242
CO     2210
IN     2078
AZ     1901
WI     1842
OR     1817
TN     1737
AL     1679
CT     1627
SC     1122
NV     1090
KS     1062
KY      983
OK      971
LA      954
UT      877
AR      855
MS      787
NE      674
ID      599
NH      551
NM      472
RI      435
HI      409
WV      391
DC      382
MT      330
DE      300
VT      207
AK      200
SD      189
IA      186
WY      150
ME      101
ND       52
Name: BorrowerState, dtype: int64
In [37]:
# Find number of null values present in state abbreviations
loans_copy.BorrowerState.isnull().sum()
Out[37]:
5515
In [38]:
# Percentage of missing states
loans_copy.BorrowerState.isnull().mean() * 100
Out[38]:
4.840394252964358
In [39]:
# Replace the null values with the phrase "Not Specified" and convert the column to category data type
loans_copy.BorrowerState.fillna('Not Specified', inplace = True)
In [40]:
loans_copy.BorrowerState = loans_copy.BorrowerState.astype('category')
In [41]:
loans_copy.BorrowerState.dtype
Out[41]:
CategoricalDtype(categories=['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL',
                  'GA', 'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA',
                  'MD', 'ME', 'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE',
                  'NH', 'NJ', 'NM', 'NV', 'NY', 'Not Specified', 'OH', 'OK',
                  'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VA', 'VT',
                  'WA', 'WI', 'WV', 'WY'],
, ordered=False)
In [42]:
# Replace the abbreviations with names for clarity using a dictonary
# First, create a dictionary of the names you would like to create
full_state_names =  {'WA': 'WASHINGTON', 'VA': 'VIRGINIA', 'DE': 'DELAWARE', 'DC': 'DISTRICT OF COLUMBIA', 'WI': 'WISCONSIN', 
                     'WV': 'WEST VIRGINIA','HI': 'HAWAII', 'FL': 'FLORIDA', 'WY': 'WYOMING', 'NH': 'NEW HAMPSHIRE', 
                     'NJ': 'NEW JERSEY', 'NM': 'NEW MEXICO','TX': 'TEXAS', 'LA': 'LOUISIANA', 'NC': 'NORTH CAROLINA', 
                     'ND': 'NORTH DAKOTA', 'NE': 'NEBRASKA', 'TN': 'TENNESSEE','NY': 'NEW YORK', 'PA': 'PENNSYLVANIA', 
                     'CA': 'CALIFORNIA','NV': 'NEVADA', 'CO': 'COLORADO', 'VI': 'VIRGIN ISLANDS', 'AK': 'ALASKA', 
                     'AL': 'ALABAMA',  'AR': 'ARKANSAS', 'VT': 'VERMONT', 'IL': 'ILLINOIS', 'GA': 'GEORGIA', 'IN': 'INDIANA',
                     'IA': 'IOWA', 'OK': 'OKLAHOMA', 'AZ': 'ARIZONA', 'ID': 'IDAHO', 'CT': 'CONNECTICUT', 'ME': 'MAINE',
                     'MD': 'MARYLAND', 'MA': 'MASSACHUSETTS','OH': 'OHIO', 'UT': 'UTAH', 'MO': 'MISSOURI','MN': 'MINNESOTA', 
                     'MI': 'MICHIGAN', 'RI': 'RHODE ISLAND','KS': 'KANSAS', 'MT': 'MONTANA', 'MS': 'MISSISSIPPI', 
                     'SC': 'SOUTH CAROLINA', 'KY': 'KENTUCKY', 'OR': 'OREGON', 'SD': 'SOUTH DAKOTA', 
                     'Not Specified': 'NOT SPECIFIED' }
In [43]:
loans_copy.BorrowerState = loans_copy.BorrowerState.map(full_state_names)
In [44]:
loans_copy.BorrowerState.value_counts()
Out[44]:
CALIFORNIA              14717
TEXAS                    6842
NEW YORK                 6729
FLORIDA                  6720
ILLINOIS                 5921
NOT SPECIFIED            5515
GEORGIA                  5008
OHIO                     4197
MICHIGAN                 3593
VIRGINIA                 3278
NEW JERSEY               3097
NORTH CAROLINA           3084
WASHINGTON               3048
PENNSYLVANIA             2972
MARYLAND                 2821
MISSOURI                 2615
MINNESOTA                2318
MASSACHUSETTS            2242
COLORADO                 2210
INDIANA                  2078
ARIZONA                  1901
WISCONSIN                1842
OREGON                   1817
TENNESSEE                1737
ALABAMA                  1679
CONNECTICUT              1627
SOUTH CAROLINA           1122
NEVADA                   1090
KANSAS                   1062
KENTUCKY                  983
OKLAHOMA                  971
LOUISIANA                 954
UTAH                      877
ARKANSAS                  855
MISSISSIPPI               787
NEBRASKA                  674
IDAHO                     599
NEW HAMPSHIRE             551
NEW MEXICO                472
RHODE ISLAND              435
HAWAII                    409
WEST VIRGINIA             391
DISTRICT OF COLUMBIA      382
MONTANA                   330
DELAWARE                  300
VERMONT                   207
ALASKA                    200
SOUTH DAKOTA              189
IOWA                      186
WYOMING                   150
MAINE                     101
NORTH DAKOTA               52
Name: BorrowerState, dtype: int64
In [45]:
type_order = loans_copy.BorrowerState.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='BorrowerState', order=type_order)
plt.xticks(rotation=90);

Occupation¶

In [46]:
loans_copy.Occupation.value_counts()
Out[46]:
Other                          28617
Professional                   13628
Computer Programmer             4478
Executive                       4311
Teacher                         3759
                               ...  
Dentist                           68
Student - College Freshman        41
Student - Community College       28
Judge                             22
Student - Technical School        16
Name: Occupation, Length: 67, dtype: int64
In [47]:
# change data type to categorical
loans_copy.Occupation = loans_copy.Occupation.astype('category')
In [48]:
loans_copy.Occupation.dtype
Out[48]:
CategoricalDtype(categories=['Accountant/CPA', 'Administrative Assistant', 'Analyst',
                  'Architect', 'Attorney', 'Biologist', 'Bus Driver',
                  'Car Dealer', 'Chemist', 'Civil Service', 'Clergy',
                  'Clerical', 'Computer Programmer', 'Construction', 'Dentist',
                  'Doctor', 'Engineer - Chemical', 'Engineer - Electrical',
                  'Engineer - Mechanical', 'Executive', 'Fireman',
                  'Flight Attendant', 'Food Service',
                  'Food Service Management', 'Homemaker', 'Investor', 'Judge',
                  'Laborer', 'Landscaping', 'Medical Technician',
                  'Military Enlisted', 'Military Officer', 'Nurse (LPN)',
                  'Nurse (RN)', 'Nurse's Aide', 'Other', 'Pharmacist',
                  'Pilot - Private/Commercial',
                  'Police Officer/Correction Officer', 'Postal Service',
                  'Principal', 'Professional', 'Professor', 'Psychologist',
                  'Realtor', 'Religious', 'Retail Management',
                  'Sales - Commission', 'Sales - Retail', 'Scientist',
                  'Skilled Labor', 'Social Worker',
                  'Student - College Freshman',
                  'Student - College Graduate Student',
                  'Student - College Junior', 'Student - College Senior',
                  'Student - College Sophomore', 'Student - Community College',
                  'Student - Technical School', 'Teacher', 'Teacher's Aide',
                  'Tradesman - Carpenter', 'Tradesman - Electrician',
                  'Tradesman - Mechanic', 'Tradesman - Plumber',
                  'Truck Driver', 'Waiter/Waitress'],
, ordered=False)

Employment Status¶

In [49]:
loans_copy.EmploymentStatus.value_counts()
Out[49]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [50]:
# Change data type to category
loans_copy.EmploymentStatus = loans_copy.EmploymentStatus.astype('category')
In [51]:
type_order = loans_copy.EmploymentStatus.value_counts().index
sns.countplot(data=loans_copy, y='EmploymentStatus', order = type_order)
Out[51]:
<AxesSubplot:xlabel='count', ylabel='EmploymentStatus'>

Income Range¶

In [52]:
loans_copy.IncomeRange.value_counts()
Out[52]:
$25,000-49,999    32192
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7741
$1-24,999          7274
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64
In [53]:
loans_copy.IncomeRange.describe()
Out[53]:
count             113937
unique                 8
top       $25,000-49,999
freq               32192
Name: IncomeRange, dtype: object
In [54]:
type_order = loans_copy.IncomeRange.value_counts().index
sns.countplot(data=loans_copy, y='IncomeRange', order=type_order, color='blue')
Out[54]:
<AxesSubplot:xlabel='count', ylabel='IncomeRange'>

IncomeVerifiable¶

In [55]:
loans_copy.IncomeVerifiable.value_counts()
Out[55]:
True     105268
False      8669
Name: IncomeVerifiable, dtype: int64
In [56]:
loans_copy.IncomeVerifiable.dtype
Out[56]:
dtype('bool')
In [57]:
sns.countplot(data=loans_copy,x='IncomeVerifiable')
Out[57]:
<AxesSubplot:xlabel='IncomeVerifiable', ylabel='count'>
In [58]:
# Structure of the dataframe
loans_copy.shape
Out[58]:
(113937, 81)

Is Borrower a HomeOwner¶

In [59]:
loans_copy.IsBorrowerHomeowner.value_counts()
Out[59]:
True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64
In [60]:
loans_copy.IsBorrowerHomeowner.dtype
Out[60]:
dtype('bool')
In [61]:
sns.countplot(data=loans_copy, x='IsBorrowerHomeowner')
Out[61]:
<AxesSubplot:xlabel='IsBorrowerHomeowner', ylabel='count'>

What is the distribution of the Prosper Scores ?¶

In [62]:
sns.countplot(data=loans_copy, x='ProsperScore')
Out[62]:
<AxesSubplot:xlabel='ProsperScore', ylabel='count'>
In [ ]:
 
In [ ]:
 

What is the structure of your dataset?¶

The dataset structure is made up of 113937 rows and 81 columns

What is/are the main feature(s) of interest in your dataset?¶

The main features of interest are the loan terms, the loan status, the interest and APR and what factors determine the allocation of loans as well as payment of loans.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

The features in the dataset that will help support the investigation are the loan terms, the loan status, the credit grade, the interest and APR, the borrower state, the occupation of the borrower, the employment status, the income range and the verifiable income of the borrower.

Univariate Exploration¶

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

In [63]:
# Create a function for drawing of countplots
def draw_countplot(x, title):
    plt.figure(figsize=(10,5))
    visual = sns.countplot(x=x,data=loans_copy, color='blue')
    for txt in visual.patches:
        visual.annotate('{:.0f}'.format(txt.get_height()), (txt.get_x()+0, txt.get_height()+0.05))
    plt.suptitle(title)
    plt.xticks(rotation=90)
    plt.plot()
plt.show()
    

Question: What kind of distribution do the Loan Terms have ?¶

In [64]:
draw_countplot('Term','Plot Showing the Distribution of Loan Terms')

From the plot above, it seems medium term loans are the most disbursed loans followed by the long term loans. Without ordering the bars in the plot, the distribution is unimodal and the distribution appears to be normal that is, not skewed to either direction

What income range applied for loans the most ?¶

In [65]:
loans_copy.IncomeRange.value_counts()
Out[65]:
$25,000-49,999    32192
$50,000-74,999    31050
$100,000+         17337
$75,000-99,999    16916
Not displayed      7741
$1-24,999          7274
Not employed        806
$0                  621
Name: IncomeRange, dtype: int64
In [66]:
type_order = loans_copy.IncomeRange.value_counts().index
sns.countplot(data=loans_copy, y='IncomeRange', order=type_order, color='blue');
plt.title(' Distribution of Income Range');

From the plot above, it is obvious that most of the loans fall within the income range of 25,000 - 49,999 followed by the 50,000 - 74,999 income range. It would be safe to say most of the loan applicants are middle-income earners.

What is the employment status of most of the applicants ?¶

In [67]:
loans_copy.EmploymentStatus.value_counts()
Out[67]:
Employed         67322
Full-time        26355
Self-employed     6134
Not available     5347
Other             3806
Part-time         1088
Not employed       835
Retired            795
Name: EmploymentStatus, dtype: int64
In [68]:
type_order = loans_copy.EmploymentStatus.value_counts().index
sns.countplot(data=loans_copy, y='EmploymentStatus', order = type_order, color='blue')
plt.title('Plot Showing Distribution of Employment Status');

The plot above shows that most of the loan applicants are employed. This should make sense because one of the terms for eligibility for a loan should be a means of paying the loan which is having some form of employment

What is the loan status of the applicants ?¶

In [69]:
draw_countplot('LoanStatus','Plot Showing the Distribution of Loan Status')

From the plot above, it is obvious that most of the loans are still being currently serviced. However, the next highest metric shows that 38,074 loans have been completed. It may indicate a strong possibility that most of the loans that are current would be completed.

What state do most of the borrower's come from ?¶

In [70]:
loans_copy.BorrowerState.value_counts()
Out[70]:
CALIFORNIA              14717
TEXAS                    6842
NEW YORK                 6729
FLORIDA                  6720
ILLINOIS                 5921
NOT SPECIFIED            5515
GEORGIA                  5008
OHIO                     4197
MICHIGAN                 3593
VIRGINIA                 3278
NEW JERSEY               3097
NORTH CAROLINA           3084
WASHINGTON               3048
PENNSYLVANIA             2972
MARYLAND                 2821
MISSOURI                 2615
MINNESOTA                2318
MASSACHUSETTS            2242
COLORADO                 2210
INDIANA                  2078
ARIZONA                  1901
WISCONSIN                1842
OREGON                   1817
TENNESSEE                1737
ALABAMA                  1679
CONNECTICUT              1627
SOUTH CAROLINA           1122
NEVADA                   1090
KANSAS                   1062
KENTUCKY                  983
OKLAHOMA                  971
LOUISIANA                 954
UTAH                      877
ARKANSAS                  855
MISSISSIPPI               787
NEBRASKA                  674
IDAHO                     599
NEW HAMPSHIRE             551
NEW MEXICO                472
RHODE ISLAND              435
HAWAII                    409
WEST VIRGINIA             391
DISTRICT OF COLUMBIA      382
MONTANA                   330
DELAWARE                  300
VERMONT                   207
ALASKA                    200
SOUTH DAKOTA              189
IOWA                      186
WYOMING                   150
MAINE                     101
NORTH DAKOTA               52
Name: BorrowerState, dtype: int64
In [71]:
type_order = loans_copy.BorrowerState.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='BorrowerState', order=type_order)
plt.xticks(rotation=90);
plt.title('Plot showing Distribution of Borrowers States');

The plot above shows that most of the lenders are from the state of California with Texas a distant second. This could be indicative of spending practices or the income levels of the people of these states. More detail on this could be gotten from subsequent bivariate oe even multivariate plots

What are the occupations of the loan applicants ?¶

In [72]:
loans_copy.Occupation.value_counts()
Out[72]:
Other                          28617
Professional                   13628
Computer Programmer             4478
Executive                       4311
Teacher                         3759
                               ...  
Dentist                           68
Student - College Freshman        41
Student - Community College       28
Judge                             22
Student - Technical School        16
Name: Occupation, Length: 67, dtype: int64
In [73]:
type_order = loans_copy.Occupation.value_counts().index
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='Occupation', order=type_order)
plt.xticks(rotation=90);

The plot above shows that most of the loan applicants have their occupations defined as Other. In other words, their occupation was not captured in the list of occupations. It is highly probable that these people are business owners as there is a nothing to account for business owners in this list

What is the distribution of the Credit Grades ?¶

In [89]:
plt.figure(figsize=[20,5])
sns.countplot(data=loans_copy, x='CreditGrade', order=['A','E','C','AA','B','D','HR'], color='blue')
plt.xticks(rotation=90);
plt.title('Plot showing Distribution of Credit Grades');
In [87]:
loans_copy.CreditGrade.value_counts()
Out[87]:
C     5649
D     5153
B     4389
AA    3509
HR    3508
A     3315
E     3289
NC     141
Name: CreditGrade, dtype: int64

The plot shows that most of the credit grades of the loans are actually of rating C followed by D. It also shows that there is a normal distribution of the credit grades of the borrowers and it is unimodal with Credit Grade C being the highest

What is the distribution of their verifiable source of income ?¶

In [75]:
draw_countplot('IncomeVerifiable','Plot Showing the Distribution of the Verifiable Income of Borrowers')

The plot above shows that most of the borrowers (105,268) had a verifiable source of income. We will see if this plays a role in the granting of loans

In [76]:
draw_countplot('ProsperScore','Plot Showing the Distribution of Prosper Score Ratings')

The plot shows a normal distribution of the ProsperScore rating. Also, it seems that the highest Prosper Score ratings are 4.0, 6.0 and 8.0.

Observations¶

From the plots, a lot of insights were gotten. Firstly, for the loan terms, for the sake of clarity, I had to convert the terms, 12 months, 36 months and 60 months into short, medium and long term respectively. Then this was converted to categorical data type and from the plot, it can be seen that the distribution is unimodal with medium term loans having the highest number 0f 87778 borrowers.

For the income range, it was interesting to see that most of the borrowers fall withing the 25,000-49,999 range which makes up 32,192 borrowers followed by 50,000-74,999 which makes up 31,050. The data had to be converted to the categorical type for easier analysis.

Also, most of the borrowers seem to be gainfully employed with 67,322 of all the borrowers being employed. This accounts for 60 % of all the borrowers. The arrangement of the bars in the chart have been changed so that they can be shown in descending order. This helps in bringing more clarity to the plot.

The loan status plot shows that majority of the loans are being currently serviced. Before, plotting the chart for this, first I had to merge all past due loans under one category for easier analysis. Then, I had to convert the data type to the category data type for easier analysis. The plot shows 56,576 borrowers are currently servicing their loans with 38,074 of the loans being paid off completely. Those two metrics are an indicator that most loans from Prosper Loan will be paid regularly and completed.

For the borrower's state, firstly I had to change the state abbreviations to full names for more clarity. Then, that column was converted to the categorical data type. From the plot, it is obvious that most of the borrowers are from the state of California which accounts for 13% of all lenders alone. Texas comes a distant second with 6% of all lenders.

Then, for the Prosper Score rating, which is a measure of the risk level of the borrower. First, this had to be converted to categorical data type and then plotted. Since this was an ordered data set, then we did not need to arrange it in ascending or descending order. The plot shows a normal distribution of the Prosper Score rating with the ratings, 4, 6 and 8 having the highest number of borrowers.

Bivariate Exploration¶

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

In [77]:
def bivariate_count_plot(x, hue, title):
    plt.figure(figsize=(15,10))
    ax = sns.countplot(x = x, hue = hue, data =loans_copy)
    for p in ax.patches:
      ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0.15, p.get_height()+0.05))
    plt.suptitle(title)
    plt.xticks(rotation = 90)
    ax.margins(y=0.1)
    plt.tight_layout()
    plt.plot()   
plt.show();
In [78]:
bivariate_count_plot('ProsperScore', 'Term', 'Plot of Prosper SCore against Loan Term')

From the clustered bar chart above, it would seem that medium term loans are in the highest demand, irrespective of the Prosper Score. It depicts that irrespective of the level of risk a borrower poses, borrowers seem more inclined towards medium term loans.

In [79]:
bivariate_count_plot('ProsperScore', 'IncomeRange', 'Plot of Prosper Score against Income Range')

From the plot, it appears that as the Prosper Score increases, there is also a commensurate increase in the income ranges across all income ranges. Especially when we look at Propser Scores from 4.0 to 8.0, this is more evident. This is an indication that income plays a key role in determining the risk level of the borrowers

In [80]:
bivariate_count_plot('LoanStatus', 'IncomeVerifiable', 'Plot of Loan Status against Verifiable Income')

From the plot above, it shows that having a verifiable income pays in a key role in determining the status of a loan. For instance, we can see that most of those with a verifiable source of income have either current loans or have completed the the payment of their loans. Even though, other loan status' also have borrowers with a verifiable source of income, these two variables can help the bank in determining the type of borrowers to lan money to

In [81]:
bivariate_count_plot('EmploymentStatus', 'IncomeVerifiable', 'Plot of Employment Status against Verifiable Income')

The plot shows that employment status is a key indicator of having verifiable income. Also, being gainfully employed seems related to having a verifiable source of income. Case in point, the muber of people employed and having full-time employment seem to have the highest numbers, 65902 and 25552. Also, it shows at the tail of the graph that most of the borrowers that are self-employed do not have a verifiable source of income

In [82]:
bivariate_count_plot('EmploymentStatus', 'IncomeRange', 'Plot of Employment Status against Income Range')
In [83]:
bivariate_count_plot('EmploymentStatus', 'IncomeRange', 'Plot of Employment Status against Income Range')

Looking at this plot of employment status and income range, it shows that employed and full time employees that applied for loans fall mostly within the 25,000 - 75,000 income range. This shows that most ot the borrowers are employed and earn between 25,000 and 75,000. This could help in knowing the demographics of the borrowers

Observations and Observed Relationships¶

Firstly, a plot of the Prosper Score along with the Loan terms was done using a clustered bar chart. From the charts, we can see that most people, irrespective of their Prosper Score, seemed more inclined towards medium term loans. This was a bit surprising because I would expect that people with a lower risk score, who would be more financially buoyant, would like to borrow more long term loans while higher risk borrowers would gravitate towards short term loans.

For the plot of Income Range and Prosper Score, it can be seen that from Prosper Score 1 to 5, there seem to be more borrowers within the 25,000 to 49,900 USD range. However, from Prosper Score 6 t0 9, it seems that most of the borrowers fall within the 50,000 to 75,000 USD and as we move more towards the highest Prosper score, we have people more in the 100,000+ income range. This points out that the higher your income range, then the better your Prosper Score.

For the plot of Verifiable Income against Loan Status, it shows that irrespective of your loan status, having a verifiable source of income goes a long way in making sure you get the loan in the first place since all the loan status' have a higher number of borrowers with verifiable income compared to borrowers with no verifiable income.

For the plot of the Employment Status and Verifiable Income, it shows that being employed goes a long way in determining if you have a verifiable source of income and increases your chances of getting a loan. However, for those unemployed, unsurprisingly, there are more borrowers with an unverifiable source of income. For those self-employed though, surprisingly, there are more borrowers with unverifiable income compared to verifiable income. I would expect that self employed people would track their business better and so their income can be verified.

For the plot of Employment Status and Income Range, it seems for the employed, full-time and self employed borrowers, most of their borrowers fall within the 25,000 to 49,900 USD range and 50,000 to 75,000 USD. However, part time and retired, their income range is from 1-24,999.

Multivariate Exploration¶

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

In [84]:
def draw_scatter(hue, title):
    plt.figure(figsize=(15,10), dpi=400)
    sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = loans_copy)
    plt.xlabel('Loan Original Amount (USD)')
    plt.ylabel('Monthly Loan Payment (USD)')
    plt.title(title);
In [85]:
draw_scatter('Term', 'Original Loan Amount Against Monthly Loan Payments by Loan Term.')

By looking at the plot, it shows a positive correlation between the loan amounts and the monthly payments . However, the slope of each term seems to get steeper as the term get shorter. For instance, we can see that for short term loans,the amount paid monthly seems higher compared to medium or long term loans. This may have something to do with the timeframe and the bank wanting to make its money back as quickly as possible within the short timeframe

In [86]:
draw_scatter('EmploymentStatus', 'Original Loan Amount Against Employment Status.')

Looking through the plot, we can see most of the borrowers are employed and as there is a positive correlation between the monthly loan payment and the original loan amount

In [ ]:
 
In [ ]:
 
In [ ]:
 

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

Comparing the Original Loan Amount against the Monthly Payments for the 3 types of loan terms, it can be seen that there is a positive correlation between loan amount and monthly payments for all 3 loan terms. However, it seems more pronounced when we come to short term loans than any other loan terms. The reason for short term loans being more pronounced is not too clear. However, I would assume that it had more to do with the fact that due to the shortness of the duration of the loan, more money has to be paid monthly to meet up with the deadline

Comparing the Original Loan Amount against the Monthly Payments based on Employment Status, it can be seen that borrowers that were fully employed seemed to pay a higher amount monthly compared to their original loan amount. It in some ways relates to the previous plots that showed a high amount of people gravitating towards medium and short term loans. That is a bit surprising.

Conclusions¶

From the questions asked and the analysis and visualizations, it can be deduced that income range plays a key role in the granting of loans to prospective borrowers. The higher your income range, the lesser you pos e a risk to the bank, that is, your Prosper Score is higher. Also, employment status is a key determinant in selecting borrowers. Though, it seems in the dataset, there are a lot of borrowers that fall under the others employment status, it could be more defined subsequently to allow for better analysis.

In [ ]: